30 research outputs found
Analyzing The Community Structure Of Web-like Networks: Models And Algorithms
This dissertation investigates the community structure of web-like networks (i.e., large, random, real-life networks such as the World Wide Web and the Internet). Recently, it has been shown that many such networks have a locally dense and globally sparse structure with certain small, dense subgraphs occurring much more frequently than they do in the classical Erdös-Rényi random graphs. This peculiarity--which is commonly referred to as community structure--has been observed in seemingly unrelated networks such as the Web, email networks, citation networks, biological networks, etc. The pervasiveness of this phenomenon has led many researchers to believe that such cohesive groups of nodes might represent meaningful entities. For example, in the Web such tightly-knit groups of nodes might represent pages with a common topic, geographical location, etc., while in the neural networks they might represent evolved computational units. The notion of community has emerged in an effort to formalize the empirical observation of the locally dense globally sparse structure of web-like networks. In the broadest sense, a community in a web-like network is defined as a group of nodes that induces a dense subgraph which is sparsely linked with the rest of the network. Due to a wide array of envisioned applications, ranging from crawlers and search engines to network security and network compression, there has recently been a widespread interest in finding efficient community-mining algorithms. In this dissertation, the community structure of web-like networks is investigated by a combination of analytical and computational techniques: First, we consider the problem of modeling the web-like networks. In the recent years, many new random graph models have been proposed to account for some recently discovered properties of web-like networks that distinguish them from the classical random graphs. The vast majority of these random graph models take into account only the addition of new nodes and edges. Yet, several empirical observations indicate that deletion of nodes and edges occurs frequently in web-like networks. Inspired by such observations, we propose and analyze two dynamic random graph models that combine node and edge addition with a uniform and a preferential deletion of nodes, respectively. In both cases, we find that the random graphs generated by such models follow power-law degree distributions (in agreement with the degree distribution of many web-like networks). Second, we analyze the expected density of certain small subgraphs--such as defensive alliances on three and four nodes--in various random graphs models. Our findings show that while in the binomial random graph the expected density of such subgraphs is very close to zero, in some dynamic random graph models it is much larger. These findings converge with our results obtained by computing the number of communities in some Web crawls. Next, we investigate the computational complexity of the community-mining problem under various definitions of community. Assuming the definition of community as a global defensive alliance, or a global offensive alliance we prove--using transformations from the dominating set problem--that finding optimal communities is an NP-complete problem. These and other similar complexity results coupled with the fact that many web-like networks are huge, indicate that it is unlikely that fast, exact sequential algorithms for mining communities may be found. To handle this difficulty we adopt an algorithmic definition of community and a simpler version of the community-mining problem, namely: find the largest community to which a given set of seed nodes belong. We propose several greedy algorithms for this problem: The first proposed algorithm starts out with a set of seed nodes--the initial community--and then repeatedly selects some nodes from community\u27s neighborhood and pulls them in the community. In each step, the algorithm uses clustering coefficient--a parameter that measures the fraction of the neighbors of a node that are neighbors themselves--to decide which nodes from the neighborhood should be pulled in the community. This algorithm has time complexity of order , where denotes the number of nodes visited by the algorithm and is the maximum degree encountered. Thus, assuming a power-law degree distribution this algorithm is expected to run in near-linear time. The proposed algorithm achieved good accuracy when tested on some real and computer-generated networks: The fraction of community nodes classified correctly is generally above 80% and often above 90% . A second algorithm based on a generalized clustering coefficient, where not only the first neighborhood is taken into account but also the second, the third, etc., is also proposed. This algorithm achieves a better accuracy than the first one but also runs slower. Finally, a randomized version of the second algorithm which improves the time complexity without affecting the accuracy significantly, is proposed. The main target application of the proposed algorithms is focused crawling--the selective search for web pages that are relevant to a pre-defined topic
Recommended from our members
Concordance and predictive value of two adverse drug event data sets
Background: Accurate prediction of adverse drug events (ADEs) is an important means of controlling and reducing drug-related morbidity and mortality. Since no single “gold standard” ADE data set exists, a range of different drug safety data sets are currently used for developing ADE prediction models. There is a critical need to assess the degree of concordance between these various ADE data sets and to validate ADE prediction models against multiple reference standards. Methods: We systematically evaluated the concordance of two widely used ADE data sets – Lexi-comp from 2010 and SIDER from 2012. The strength of the association between ADE (drug) counts in Lexi-comp and SIDER was assessed using Spearman rank correlation, while the differences between the two data sets were characterized in terms of drug categories, ADE categories and ADE frequencies. We also performed a comparative validation of the Predictive Pharmacosafety Networks (PPN) model using both ADE data sets. The predictive power of PPN using each of the two validation sets was assessed using the area under Receiver Operating Characteristic curve (AUROC). Results: The correlations between the counts of ADEs and drugs in the two data sets were 0.84 (95% CI: 0.82-0.86) and 0.92 (95% CI: 0.91-0.93), respectively. Relative to an earlier snapshot of Lexi-comp from 2005, Lexi-comp 2010 and SIDER 2012 introduced a mean of 1,973 and 4,810 new drug-ADE associations per year, respectively. The difference between these two data sets was most pronounced for Nervous System and Anti-infective drugs, Gastrointestinal and Nervous System ADEs, and postmarketing ADEs. A minor difference of 1.1% was found in the AUROC of PPN when SIDER 2012 was used for validation instead of Lexi-comp 2010. Conclusions: In conclusion, the ADE and drug counts in Lexi-comp and SIDER data sets were highly correlated and the choice of validation set did not greatly affect the overall prediction performance of PPN. Our results also suggest that it is important to be aware of the differences that exist among ADE data sets, especially in modeling applications focused on specific drug and ADE categories
Recommended from our members
Pharmacointeraction Network Models Predict Unknown Drug-Drug Interactions
Drug-drug interactions (DDIs) can lead to serious and potentially lethal adverse events. In recent years, several drugs have been withdrawn from the market due to interaction-related adverse events (AEs). Current methods for detecting DDIs rely on the accumulation of sufficient clinical evidence in the post-market stage – a lengthy process that often takes years, during which time numerous patients may suffer from the adverse effects of the DDI. Detection methods are further hindered by the extremely large combinatoric space of possible drug-drug-AE combinations. There is therefore a practical need for predictive tools that can identify potential DDIs years in advance, enabling drug safety professionals to better prioritize their limited investigative resources and take appropriate regulatory action. To meet this need, we describe Predictive Pharmacointeraction Networks (PPINs) – a novel approach that predicts unknown DDIs by exploiting the network structure of all known DDIs, together with other intrinsic and taxonomic properties of drugs and AEs. We constructed an 856-drug DDI network from a 2009 snapshot of a widely-used drug safety database, and used it to develop PPIN models for predicting future DDIs. We compared the DDIs predicted based solely on these 2009 data, with newly reported DDIs that appeared in a 2012 snapshot of the same database. Using a standard multivariate approach to combine predictors, the PPIN model achieved an AUROC (area under the receiver operating characteristic curve) of 0.81 with a sensitivity of 48% given a specificity of 90%. An analysis of DDIs by severity level revealed that the model was most effective for predicting “contraindicated” DDIs (AUROC = 0.92) and less effective for “minor” DDIs (AUROC = 0.63). These results indicate that network based methods can be useful for predicting unknown drug-drug interactions
Evaluation of a Graph-based Topical Crawler
Abstract – Topical (or, focused) crawlers have become important tools in dealing with the massiveness and dynamic nature of the World Wide Web. Guided by a data mining component that monitors and analyzes the boundary of the set of crawled pages, a focused crawler selectively seeks out pages on a pre-defined topic. Recent research indicates that both the textual content of web pages and the structural information enclosed in the Web graph need to be exploited in order to build high quality focused crawlers. While, a variety of text-based and graphbased measures of similarity that can direct a focused crawler toward relevant pages have been developed, much remains to be done toward formally evaluating and ranking the effectiveness of various focused crawling algorithms. Inspired by a recent and comprehensive evaluation framework for focused crawlers, we analyze the performance of a graph-based algorithm and compare it with two other algorithms: a breadth-first one and a textbased, best-first one. The results suggest that our graphbased algorithm is faster and only slightly less effective than the text-based, best-first algorithm, while significantly outperforming the breadth-first one
Mining Parameters That Characterize The Communities In Web-Like Networks
Community mining in large, complex, real-life networks such as the World Wide Web has emerged as a key data mining problem with important applications. In recent years, several graph theoretic definitions of community, generally motivated by empirical observations and intuitive arguments, have been put forward. However, a formal evaluation of the appropriateness of such definitions has been lacking. We present a new framework developed to address this issue, and then discuss a particular implementation of this framework. Finally, we present a set of experiments aimed at evaluating the effectiveness of two specific graph theoretic structures-alliance and near-clique - in capturing the essential properties of communities. © 2006 IEEE
A Birth-Death Dynamic Model Of Scale-Free Networks
We study a dynamic model of scale-free networks which incorporates not only the birth of vertices and edges but also their death. We analyze the degree distribution of this model by employing a mean-field approach and numerical simulations. Copyright 2005 ACM
Techniques For Analyzing Dynamic Random Graph Models Of Web-Like Networks: An Overview
Various random graph models have recently been proposed to replicate and explain the topology of large, complex, real-life networks such as the World Wide Web and the Internet. These models are surveyed in this article. Our focus has primarily been on dynamic random graph models that attempt to account for the observed statistical properties of web-like networks through certain dynamic processes guided by simple stochastic rules. Particular attention is paid to the equivalence between mathematical definitions of dynamic random graphs in terms of inductively defined probability spaces and algorithmic definitions of such models in terms of recursive procedures. Several techniques that have been employed for studying dynamic random graphs-both heuristic and analytic-are expounded. Each technique is illustrated through its application in analyzing various graph parameters, such as degree distribution, degreecorrelation between adjacent nodes, clustering coefficient, distribution of node-pair distances, and connected-component size. A discussion of the most recent salient work and a comprehensive list of references in this rapidly-expanding area are included. © 2007 Wiley Periodicals, Inc
Preferential Deletion In Dynamic Models Of Web-Like Networks
In this paper a discrete-time dynamic random graph process is studied that interleaves the birth of nodes and edges with the death of nodes. In this model, at each time step either a new node is added or an existing node is deleted. A node is added with probability p together with an edge incident on it. The node at the other end of this new edge is chosen based on a linear preferential attachment rule. A node (and all the edges incident on it) is deleted with probability q = 1 - p. The node to be deleted is chosen based on a probability distribution that favors small-degree nodes, in view of recent empirical findings. We analyze the degree distribution of this model and find that the expected fraction of nodes with degree k in the graph generated by this process decreases asymptotically as k- 1 - (2 p / 2 p - 1). © 2007 Elsevier B.V. All rights reserved